18.1 The log-likehood GradientΒΆ
Target probability distribution:
\[p(x; \theta) = \frac{1}{Z(\theta)}\hat{p}(x; \theta)\]
The gradient
\[\nabla_{\theta} = \nabla_{\theta} \log\hat{p}(x;\theta) - \nabla_{\theta} \log Z(\theta)\]
- \(\nabla_{\theta} \log\hat{p}(x;\theta)\) : Positive phase. Models with no latent variables or with few interaction between the latent variables typically have a tractable positive phase.
- \(\nabla_{\theta} \log Z(\theta)\): Negative phase. For undirected model, the negative phase is difficult.
Closer look at gradient of log Z, assuming p(x) > 0 for all x:
\[\begin{split}\begin {equation}
\begin{split}
\nabla_{\theta} \log Z &= \frac{\nabla_{\theta} Z}{Z} \\
&= \frac{\nabla_{\theta}\sum_x \hat{p}(x)}{Z} \\
&= \frac{\sum_x \nabla_{\theta} \hat{p}(x)}{Z} \\
&= \frac{\sum_x \hat{p}(x) \frac{\nabla_{\theta} \hat{p}(x)}{\hat{p}(x)} }{Z} \\
&= \frac{\sum_x \hat{p}(x) \nabla_{\theta} \log \hat{p}(x) }{Z} \\
&= \sum_x p(x) \nabla_{\theta} \log \hat{p}(x) \\
&= E_{x\sim p(x)} \nabla_{\theta} \log\hat{p}(x)
\end{split}
\end {equation}\end{split}\]
This is the basis for a variety of Monte Carlo methods for approximately maximizing the likehood of models with intractable partition function.
- In the positive phase, we increase \(\log \hat{p}(x)\) for drawn from the data.
- In the negative phase, we decrease the partition function by decreasing \(\log \hat{p}(x)\) drawn from the model distribution.
Review on Monte Carlo Methods:
The idea: view the sum or integral as if it were an expectation under some distribution and to approximate the expectation by a corresponding average. The sum or integral to estimate:
\[\begin{split}s = \sum_x p(x)f(x) = E_p[f(x)] \\
or \\
s = \int p(x)f(x) = E_p[f(x)]\end{split}\]
We can approximate s by drawing n samples \(x^{(1)}, x^{(2)} .... x^{(n)}\) from p and then forming the empirical average
\[\hat{s}_n = \frac{1}{n} \sum_{i=1}^{n}f(x^{(i)})\]